Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adds spatial indexing capabilities to Magellan.
You can now index points, polygons and other shapes in Magellan by ZOrder Curves.
A general abstraction for a Spatial Index is defined and a Z Order curve implementation is provided.
In particular, Geohashing is implemented as a special case of a Z Order curve with a bounding box of the globe.
example:
df.withColumn($"point" geohash 25) will geohash each Point(long, lat) into a geohash with a 5 character precision.
df.withColumn($"polygon" geohash 25) will return a List of all the geohashes with 5 character length that intersect the polygon.
Spatial Joins can be performed by first creating an index on each dataframe and doing a hash join on the index followed by applying the predicate.
As an example:
pointsdf.join(polygonsdf).where($"point" within $ "polygon") can be rewritten as$"point" within $ "polygon")
val indexedpointsdf = pointsdf.withColumn("index", $"point geohash 25).select('point, explode($"index").as("index"))
val indexedpolygonsdf = polygonsdf.withColumn("index", $"polygon" geohash 25).select('polygon, explode($"index").as("index"))
indexedpointsdf.join(indexedpolygonsdf, indexedpointsdf("index") === indexedpolygonsdf("index")).where(
achieves the same join, except it first creates a geohash index on each shape, then performs a hashjoin on the index followed by eliminating possible mismatches (where the geohash boundaries of the point do intersect the polygon, but the point itself does not) using the where clause.
This is usually faster than the default join in Magellan, which is a cross join.
However, constructing the spatial indices does take time so there is a tradeoff involved.
For sufficiently large datasets this join should outperform the naive cross join significantly.
A future Pull Request will add a join optimization based on this PR such that the user does not have to manually rewrite the join into a hashjoin and the optimizer does this automatically.